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Abstract 


The ability to combine symbols to generate 
language is a defining characteristic of human 
intelligence, particularly in the context of artis- 
tic story-telling through lyrics. We develop 
a method for synthesizing a rap verse based 
on the content of any text (e.g., a news arti- 
cle), or for augmenting pre-existing rap lyrics. 
Our method, called RAPFORMER, is based 
on training a Transformer-based denoising au- 
toencoder to reconstruct rap lyrics from con- 
tent words extracted from the lyrics, trying to 
preserve the essential meaning, while match- 
ing the target style. RAPFORMER features 
a novel BERT-based paraphrasing scheme for 
rhyme enhancement which increases the aver- 
age rhyme density of output lyrics by 10%. Ex- 
perimental results on three diverse input do- 
mains show that RAPFORMER is capable of 
generating technically fluent verses that of- 
fer a good trade-off between content preserva- 
tion and style transfer. Furthermore, a Turing- 
test-like experiment reveals that RAPFORMER 
fools human lyrics experts 25% of the time.! 


1 Introduction 


Automatic lyrics generation is a challenging lan- 
guage generation task for any musical genre, requir- 
ing story development and creativity while adher- 
ing to the structural constraints of song lyrics. Here 
we focus on the generation of rap lyrics, which 
poses three additional challenges specific to the rap 
genre: (2) a verse in rap lyrics often comprises mul- 
tiple rhyme structures which may change through- 
out a verse (Bradley, 2017), (ii) the number of 
words in a typical rap verse is significantly larger 
when compared to other music genres (Mayer et al., 
2008), requiring modeling of long-term dependen- 
cies, and (iii) the presence of many slang words. 
‘We created a song with lyrics generated by RAPFORMER 
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i Training: 1. Extract content words 
Existing rap verse Content words 
This is a job — 


Stripping 
| get paid to sling some raps, Approach job get paid sling raps made 
What you made last year last year was less income tax : 


Sequence 
Model 


was less than my income tax 


3. Input novel content 
‘application 1: Style transfer 
: (e.g. using a news article input) 

: mazzy was on board a 

: | southwest airlines flight 
: | over the us when the 

: | entire airplane sang 


Content words 


Siia mazzy board southwest airlines | : 
IPRITO flight us airplane sang happy : 
Approach birthday brought tears joy : 


Novel output rap verse y 


tears board when i was happy on : 
southwest airplane Sequence s 
brought a joy to the entire flight, Model i 
celebrate : 


Augmented rap lyrics 


her happy birthday and 
: | brought her to tears of 
: (joy. 
application 2: Rap reconstruction 
(using existing rap lyrics as input) 


despise the propaganda rise, higher A 
mac-11 camouflage for example, 
that's why i never set fires 

i walk with a flame 

that never match my desires 

take a pic, cause the pain is higher |: 


:(teflon's on the rise, i despise propaganda 
‘|camouflage mac-11, 

‘Ji should set an example 

i never baptized, as i walk through the fires 
:|the pain and the flame 
:(never match my desires 


Figure 1: Overview of our approach to conditional rap lyrics 
generation. Training: (1) extract content words from existing 
rap verses, then (2) train sequence models to guess the original 
verses conditioned on the content words. Inference: (3) Input 
content from non-rap texts to produce content-controlled rap 
verses; or input existing rap verses to augment them. 


Prior approaches to rap generation typically 
use unconditional generation (Potash et al., 2015; 
Malmi et al., 2016). That approach synthesizes 
lyrics without providing any context that could be 
useful to guide the narrative development into a 
coherent direction (Dathathri et al., 2020). For ex- 
ample, generating rap lyrics on a specific topic, 
e.g., ”cooking,” is not possible with unconditional 
generation. Motivated by this, in this paper, we pro- 
pose a novel approach for conditional generation 
of rap verses, where the generator is provided a 
source text and tasked with transferring the style of 
the text into rap lyrics. Compared to unconditional 
generation, this task can support the human cre- 
ative process more effectively as it allows a human 
writer to engage with the generator by providing 
the content of the lyrics while receiving automatic 
suggestions on how to improve the style of the 
lyrics to resemble the rap domain. 
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Our approach to conditional generation is to 
train sequence-to-sequence models (Vaswani et al., 
2017) to reconstruct existing rap verses conditioned 
on a list of content words extracted from the verses 
(Figure 1). By learning a mapping from content 
words to complete verses, we implicitly learn the 
latent structure of rap verses given content, while 
preserving the target output style of the rap lyrics. 
Model outputs are enhanced by a post-processing 
step (Section 3.2) that substitutes non-rhyming end- 
of-line words with suitable rhyming alternatives. 

We test our method on three diverse input do- 
mains: short summaries of news articles, movie 
plot summaries, and existing rap lyrics. Automatic 
and human evaluations (Sections 5 and 6) suggest 
that our method provides a trade-off between con- 
tent preservation and style compared to a strong 
information retrieval baseline. 


2 Background 


2.1 Rap Lyrics Generation 


Prior work on rap lyrics generation often focuses 
on unconditional generation, either using language 
models (Potash et al., 2015) or by stitching together 
lines from existing rap lyrics using information re- 
trieval methods (Malmi et al., 2016). There are two 
main drawbacks of unconditional generation of rap 
lyrics. First, the open-ended nature of the task is 
too unconstrained for generating lyrics with more 
specific content: ideally, we may want to have con- 
trol over at least some aspects of the model during 
inference, such as the topic of the lyrics, or their 
sentiment. Second, although frequent rhyming is 
an essential feature of fluent rap verses (Malmi 
et al., 2016), language models have no built-in in- 
centive to learn to consistently generate rhymes at 
the end of each line, prompting researchers to in- 
vent techniques to promote rhyming in their models 
separately (Hopkins and Kiela, 2017). 

More recently, Manjavacas et al. (2019) propose 
a conditional approach to rap lyrics generation, 
which extracts high-level features from the lyrics, 
such as their sentiment, mood, or tense, to provide 
a template during generation. Although their ap- 
proach allows for some control during generation, 
it is limited in terms of generating lyrics with more 
specific content. The work that is closest to ours 
is (Lee et al., 2019) who propose an approach to 
sentence style transfer based on text denoising, and 
test their approach on style transfer from pop to 
rap lyrics. In contrast to these works, we condition 


361 


the model on longer input text and also introduce 
a novel method for enhancing the rhymes of our 
output verses. We also perform extensive auto- 
matic and human evaluations on style transfer from 
diverse input domains to rap lyrics. 


2.2 Text Rewriting and Style Transfer 


Recent work on style transfer of text (Fu et al., 
2018; Shen et al., 2017; Prabhumoye et al., 2018; 
Lample et al., 2019; Liu et al., 2019), focuses on 
transfer from one text attribute to another, such 
as gender or political inclination. The main dif- 
ference between such studies and our work is that 
our setting is more lenient with respect to mean- 
ing preservation: our focus here is on generating 
creative and fluent verses that match the overall 
topic of the input and also preserve some of the 
content. Our conditional lyrics generation based 
on denoising autoencoders is also related to recent 
work on self-supervised pre-training objectives for 
text-to-text generation tasks, which have been ben- 
eficial for many NLP tasks, such as automatic text 
summarization (Zhang et al., 2020), question an- 
swering (Lewis et al., 2020; Raffel et al., 2019), and 
data-to-text generation (Freitag and Roy, 2018). 


3 Conditional Generation of Lyrics 


Our approach to conditional generation of rap 
verses consists of three steps (Figure 1). 


1. Given a dataset of rap verses, we apply a strip- 
ping approach to extract from each verse a 
set of content words that aim to resemble the 
main content of the original text, omitting any 
specific stylistic information. 


2. We train a Transformer model (Vaswani et al., 
2017) to reconstruct the original rap verses 
conditioned on the content words. The model 
learns to generate the original verse, filling in 
missing stylistic information. 


3. At inference time, we can input content words 
extracted from a text written in any style, such 
as a news article, resulting in novel output 
rhyme verses. After generation, we option- 
ally apply a rhyme enhancement step (Section 
3.2). 


3.1 Stripping Approach 


Given a dataset of original rap verses, our base 
approach to extracting content words involves pre- 


processing each verse to remove all stop words?, 
numbers, and punctuation. To promote greater nov- 
elty? and variability in the outputs produced by our 
models, we additionally apply one of three noise 
types to the stripped content words: 


Shuffle. We shuffle all of the content words on 
the sentence level (line level for rap verses). This 
type of noise forces our models to learn to rearrange 
the location of the input content words when gen- 
erating the output rap lyric, rather than to merely 
copy words from the input in an identical order. 
A similar noising approach has been recently em- 
ployed by Raffel et al. (2019). 


Drop. We randomly remove 20% of the input 
content words for the purpose of promoting gen- 
eration of novel words, rather than only copying 
content words from the input. 


Synonym. We replace 20% of the content words 
with synonyms obtained from WordNet (Miller, 
1995). We pick words randomly and replace them 
with a random synonym. This type of noise pro- 
motes our models to learn to replace content words 
with synonyms, which might fit better in the con- 
text or style of the current output rap verse. 


3.2 Rhyme Enhancement with BERT 


To improve the rhyming fluency of our models, 
we implement a post-processing step for rhyme en- 
hancement (RE) which modifies a generated verse 
to introduce additional end-of-line rhymes. Given 
two lines from a generated verse, such as: 


where were you? 
last year i was paid in a drought with no beginners 


RE iterates over each of the lines in the verse, re- 
placing the ending words with a MASK token. The 
verse is then passed through a BERT model* (De- 
vlin et al., 2019) which predicts the K = 200 most 
likely replacement candidates for MASK. For exam 
ple, the replacement candidates for you might be 
{they, we, 1, it}, and for beginners might be {food, 
fruit, you, rules}. We pick the candidate that leads 
to the highest increase in rhyming, determined by 
the length of the longest overlapping vowels in the 


>We use the list of English stopwords defined in NLTK. 

In early experiments, we tested training models using 
only this base approach. The models performed very well 
at reconstructing existing rap lyrics, however when the input 
was from a different domain, we observed very conservative 
outputs. 

“We finetune a BERT base model on our rap verse dataset 
for 20 epochs. 
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Algorithm 1: Bert Rhyme Enhancement 


input :lyrics verse V = {lo,..., lv} consisting of 
N tokenized lines; number of BERT 
predictions K to consider. 

output : modified V with enhanced rhyming. 


Function get _rhyming_replacement (V, 
src_idx, tgt_idx, mask) : 

src + V [sre_idx][-1] // get last word 

tgt + V [tgt_idx][-1] 


// Predict most likely words. 
preds + bert_predictions (mask, K) 
// Compute original rhyme 
length. 
rl_orig + rhyme_length (src, tgt) 
for pred € preds do 
rlnew + rhyme_length (pred, tgt) 
if rl_new > rl_orig then 
// return replacement 
return pred, rl_new 
return target, rl_orig // return 
original 


fori + 1,3,...,N// for each odd line 

do 

// Create two masks for the two 
consecutive lines. 

mask_1 + mask_text (V, i) 

mask_2 + mask_text (V,i+ 1) 

// Generate replacement 
candidates. 

cand_1, rl_1 << 
get_rhyming_replacement (V,i+1, 
i,mask_1)// replace last word 
at 2 

cand_2, rl_2 <— 
get_rhyming_replacement (V, i, i+ 
1, mask_2)// replace last word 
atiti 

ifrl2>rl1// update lines in V 

then 

| V [i+ 1-1] < cand_2 
else 
| VEJI] 4} cand 


return V 


two words (Malmi et al., 2016). In the example 
above, replacing beginners with food maximizes 
the rhyme length, and the example becomes: 
where were you? 
last year i was paid in a drought with no food 
Algorithm 1 contains a detailed implementation 
of our approach. 


4 Experimental Setup 


Datasets. We conduct experiments using three 
datasets. As our rap dataset, we use 60k English 
rap lyrics provided by Musixmatch." 

We split each lyric into verses (in the dataset, 
each verse is separated by a blank line), remove 


Shttps://www.musixmatch.com/ 


News Movies Rap 
# Pairs 287k/11k/l1lk | -/-/12k 165k/1k/1k 
Sent. p.d. 3.741.2 3.9 + 1.6 10.5 + 4.5 
Tok. p.d. 57.9 + 24.3 90 + 27.6 | 91.8 + 49.1 
Tok. p.s. 15.1 + 4.7 22.44 + 11 9.5 + 4.25 


Table 1: Statistics of our datasets. # Pairs denotes the 
number of pairs used for training/validation/testing; p.d. 
is per document; p.s. is per sentence. 


verses shorter than 4 lines in order to filter for song 
choruses and intros, and reserve 2k song lyrics 
for validation and testing. We use two datasets as 
our out-of-domain inputs: (1) the summaries from 
the CNN/DailyMail news summarization dataset 
(Hermann et al., 2015) and (2) a subset of the CMU 
movie plot summary corpus (Bamman et al., 2013). 
Since some of the movie summaries are very long, 
for this dataset, we filter summaries longer than 
140 tokens and shorter than 40 tokens. Table 1 
contains detailed statistics of the datasets used for 
training/validation/testing in our experiments. 


Model details. As our sequence transducer, we 
use a 6-layer Transformer encoder-decoder model 
(Vaswani et al., 2017). We initially train our mod- 
els on the source domain (e.g., news articles) for 20 
epochs, after which we finetune them on rap verses 
for an additional 20 epochs, using the same strip- 
ping approach for both. We train all of our models 
on the subword level (Sennrich et al., 2016), ex- 
tracting a common vocabulary of 50k tokens from 
a joint collection of news summaries and rap lyrics. 
We use the same vocabulary for both our encoders 
and decoders and use the Fairseq library. We train 
all of our models on a single GTX 1080 Ti card. 


Generation details. During inference, we gener- 
ate outputs using diverse beam search (Vijayaku- 
mar et al., 2018) to promote greater diversity across 
the hypothesis space. We use a beam with a size 
of 24 and 6 diverse beam groups. Furthermore, we 
limit the maximum output sequence length to two 
times the length of the input content words and 
penalize repetitions of bigrams in the outputs. 

To select our final output, we additionally imple- 
ment a simple hypothesis reranking method. For 
each of the 24 final predictions on the beam, we 
compute two scores: the rhyme density (RD) of 
the text, following (Malmi et al., 2016), as well as 


Snttps://github.com/pytorch/fairseq 


its repetition score: 


rep(s) = = me a @ 


rep measures the average unigram overlap (see 
Section 5.1) of each sentence s; in the text s with 
all other sentences of the text concatenated into a 
single string (denoted as 5;). We pick the hypothe- 
sis that maximizes: score(s) = RD(s) — rep(s). 
Afterwards, we optionally apply our rhyme en- 
hancement step, to further increase the frequency 
of rhymes in our outputs. 


Bias mitigation Rap lyrics, like other human- 
produced texts, may contain harmful biases and 
offensive content which text generation models 
should not propagate further. Our conditional lyrics 
generation setup is less susceptible to this issue 
since the user provides the content, and the gen- 
erator is supposed to modify only the style of the 
text. Yet, the model may learn to use inappropriate 
individual terms that are common in rap lyrics. To 
alleviate this, we maintain a deny list of words that 
the model is not able to generate. 


5 Automatic Evaluation 


We conduct an automatic evaluation of RAP- 
FORMER, using the test sets of each of our three 
datasets. Our focus is on measuring two compo- 
nents that are important for generating fluent condi- 
tional rap verses: preserving content from the input 
text to the output, and maintaining rhyming fluency 
during generation. 


5.1 Evaluation Metrics 


Content preservation. We test the capacity of 
our models to preserve content words from the 
input by computing a unigram overlap score: 


_ Hy} 0 {x4 


overlap(x, y) = ————— (2) 
Hy} 
between unique unigrams from an input text x and 
the generated output rap verse y. We also report the 
BLEU score (Papineni et al., 2002) when training 
a model to reconstruct original lyrics. 


Rhyming fluency. We measure the technical 
quality of our rap verses using the rhyme density 
(RD) metric (Malmi et al., 2016).’? The metric is 
based on computing a phonetic transcription of the 


Thttps://github.com/ekQ/raplysaattori 
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Rap reconstruction 
Model BLEU Overlap RD 
INPUTS - - 0.84 + 0.38 
IR NEws - - - 
IR RAP - - - 
x SHUFFLE 10.27 0.63 +0.13 1.01 0.31 
g SHUFFLE + RE 12.72 0.60+0.12 1.10 0.32 
2 DROP 11.06 0.52+0.11 1.03 0.32 
3 DROP + RE 09.81 0.50+0.11 1.13 + 0.33 
< REPLACE 14.30 0.57 0.15 1.00 0.30 
~% REPLACE + RE 12.72 054+0.15 1.10+0.31 


Style transfer from movies | Style transfer from news 
Overlap RD Overlap RD 

- 0.73 + 0.2 - 0.72 + 0.21 

E - 0.29 +0.09 0.74 0.19 
0.19+ 0.06 1.02 + 0.23 | 0.17 +0.06 1.01 0.24 
0.51 +0.11 0.90 +0.23 | 0.45 +0.12 0.89 0.26 
0.49+ 0.10 0.96 +0.27 | 0.43 +0.11 0.98 + 0.27 
0.43 +0.10 0.90 +0.24 | 0.38 +0.10 0.93 + 0.25 
0.40 +0.09 0.99 +0.27 | 0.36 +0.10 1.03 0.26 
0.43 +0.14 0.86 +0.28 | 0.34 +0.13 0.95 0.27 
0.40 +0.13 0.98 +0.24 | 0.31 +0.12 1.05 0.28 


Table 2: Automatic metric results of RAPFORMER, using three alternative stripping approaches: SHUFFLE, DROP 
and REPLACE. Model names ending in * + RE denote use of the additional rhyme enhancement step (see Section 
3.2). INPUT measures the result of the original input texts, for each of the three inputs (rap/movies/news). Overlap 
is the content preservation score, RD is the rhyme density metric. The highest results for each column are in bold. 


lyrics and finding the average length of matching 
vowel sound sequences which resemble multisyl- 
labic assonance rhymes. As a reference, RD values 
above 1 can be considered high, with some rap 
artists reaching up to 1.2. 


5.2 Baselines 


For reference, we report the result of an informa- 
tion retrieval baseline, which retrieves the closest 
text from our training dataset given input from the 
news or movies test sets, using sentence embed- 
ding similarity. We report two variants of the IR 
baseline. First, we retrieve the closest summary 
from the CNN/DailyMail news training set (IR 
NEWS), which resembles a lower bound for our 
target task of style transfer from news to rap lyrics. 
Second, we retrieve the closest verse from our rap 
training set (IR RAP). The outputs of the strong 
IR Rap baseline perfectly match the style of orig- 
inal rap verses, giving us an upper bound for rap 
style, while maintaining some degree of lexical and 
semantic overlap with the input texts. 


5.3 Results 


Our results are shown in Table 2, where we include 
all of our stripping approaches (Shuffle, Drop, Re- 
place). We report the results of applying the addi- 
tional rhyme enhancement step separately (model 
names ending with ”+ RE”). 


Rap reconstruction. In the left part of Table 2, 
we evaluate our model’s capacity to reliably re- 
generate original rap lyrics given extracted content 
words from them. RAPFORMER performed well on 
this task, generating fluent lyrics that incorporate a 
large part of the input content words and surpassing 


’We use a 600-dimensional Sent2Vec model (Pagliardini 
et al., 2018), which is pretrained on Wikipedia. 
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the average rhyme density observed in the training 
dataset (INPUTS). When using our rhyme enhance- 
ment step, we observe a slight decrease in overlap 
due to the potential replacement of content words. 
However, RD increases by 10% on average. 


Style transfer. In the right part of Table 2, we 
evaluate the capacity of our model to generate rap 
lyrics using content words extracted from movie 
plot summaries or news article summaries. For 
these inputs, our model generated outputs with 
lower overlap on average than for rap reconstruc- 
tion, with movies retaining slightly more content 
than news. This gap is potentially due to the large 
differences in style, vocabulary, and topic of the 
inputs, prompting our models to ignore some of the 
content words to better match the target rap style. 
Still, our generation methods manage to achieve 
similar RD scores while considerably outperform- 
ing the strong IR baseline in terms of overlap. 


6 Human Evaluation 


Due to the limitations of automatic metrics for text 
generation, we also perform four human evalua- 
tion experiments using three raters, who are trained 
to translate lyrics. Due to limited resources, we 
evaluate only the RAPFORMER variant with the 
SHUFFLE stripping approach and rhyme enhance- 
ment, which achieved the highest content overlap 
in our automatic evaluation. 

The first two human experiments (in Table 3) 
focus on style transfer using news articles as in- 
put. Each rater inspected 100 verses produced by 
either the RAPFORMER, or the two IR baselines, 
answering the following three questions: 


1. How much do the lyrics presented resemble 
rap lyrics? Ona scale from I (not at all), 


Method |Style Meaning Familiarity 
IR NEws 1.18 2.01 1% 
IR RAP 4.27 1.33 31% 
RAPFORMER | 2.03 2.55 8% 


Table 3: Human evaluation results of RAPFORMER (us- 
ing the SHUFFLE stripping approach, and news articles 
as input). The average inter-rater agreement for Style is 
0.3, and for Meaning is —0.1, measured using Cohen’s 
Kappa (Cohen, 1960). 


to 5 (this could be from existing rap lyrics), 
which measures the capacity of our models to 
preserve the Style. 


2. How well do the lyrics preserve the content of 
the original news article on a scale from 1 (not 
at all) to 5 (very well)? This question mea- 
sures the meaning preservation of our models 
(Meaning). 


3. Do these lyrics look like a song you know (yes 
or no)? For IR RAP, this question measures 
the Familiarity of the raters with the lyrics; 
for the other two methods, it measures the 
capacity to fool the raters. 


Method | Side-by-Side Random 


25% 


RAPFORMER | 1% 


Table 4: Turing-like evaluation, reporting the percent- 
age of lyrics generated by RAPFORMER (using the 
SHUFFLE stripping approach, and rap lyrics as input) 
that human experts incorrectly label as existing rap 
lyrics. The average inter-rater agreement for Side-by- 
Side is 0.8, and for Random is 0.4, measured using 
Cohen’s Kappa (Cohen, 1960). 


The other two human experiments (in Table 4) 
focus on our rap reconstruction task, performing 
two Turing-test-like comparisons between 100 real 
and synthetic verses: 


1. Side-by-Side: the original rap lyrics and RAP- 
FORMER lyrics are presented side-by-side, in 
a random order, and a rater is asked, Which of 
these lyrics was written by a human? (see the 
Appendix for examples). 


2. Random: a verse is shown and the rater is 
asked, ”Do you think these rap lyrics are: (a) 
Al-generated or (b) human-created?”’. 


In terms of style (Table 3), we outperform IR 
NEWS, demonstrating that there is a change in style 
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INPUT (RAP): before the dough came , my whole aim , 
was blow like propane 

control the whole domain , and then show no shame 
make rappers go ? and they so lame , playing with no game 
put em on the lil plane til they can ’t claim no fame 

i got , the range , better , stay in the slow lane 

i make the flow change from hurricanes to a slow rain 
your thoughts are so plane , i rearrange your whole frame 
until my whole name grow out your brain like rogaine 


SHUFFLE: aim dough like propane came with a whole 
blow 

shame how you control the whole show 

lame rappers playin’ the game, make your domain go 
can’t claim em til you put a lil fame on a plane 

slow down, stay in your lane, got a better range 

make it rain, change your flow, slow down 

rearrange your whole frame, plane thoughts 

grow like rogaine, put my whole name on your brain 


SHUFFLE + RE: aim dough like propane came with a 
whole blow 

shame how you control the whole show 

lame rappers playin’ the game, make your domain go 
can’t claim em til you put a lil fame on a show 

slow down, stay in your lane, got a better range 

make it rain, change your flow, slow rain 

rearrange your whole frame, plane change 

grow like rogaine, put my whole name on your brain 


Table 5: Example model output for rap reconstruction. 
Words replaced by our rhyme enhancement step are in 
bold. The input lyrics are from the song How I Get 
Down by Rakim. 


towards rap verses. There is still a large gap to 
reach the fluency of original rap verses retrieved 
by IR RAP. However, it is worth noting that the 
content preservation of IR RAP is considerably 
lower, as shown in Tables 2 and 3, and simply the 
fact that the content of the generated lyrics is closer 
to the news domain might encourage the raters 
to rate the generated lyrics as having a lower rap 
resemblance score. In other words, the style score 
of IR RAP might be unrealistic to attain even with 
a perfect conditional generator. 


Overall, the results indicate that our method pro- 
vides a trade-off between the two baselines in terms 
of style while outperforming them in terms of con- 
tent overlap. Furthermore, 8% of the time, our 
conditional generation model fooled experienced 
raters to think that our synthetic rap lyrics gen- 
erated from news articles originate from real rap 
songs. Our rap lyrics augmentation approach also 
proved to be robust in the Turing-style evaluation of 
rap reconstruction (Table 4), where RAPFORMER 
fooled the raters 25% of the time when lyrics from 
a random source are presented one-by-one, and 7% 


INPUT (MOVIES): the film follows the lives of several west 


point cadet classmates who find themselves on opposite sides 
of the war . the film also follows the adventures of lucius the 
slave escaping via the underground railroad to freedom with 
the film cutting between the first battle of bull run and the 
birth of a lucius ° child born in slavery . 


SHUFFLE: this is the opposite of war follows lives on both 
sides 

several point film from the west to the wrong 

find a child born escaping via film 

film the underground cutting off the film of all the complica- 
tions 

slave, run from lucius slavery 

battle of freedom and birth 

also the first bull follows luc-up! 


SHUFFLE + RE: this is the opposite of war follows lives on 
both sides 

several point film from the west to the light 

find a child born escaping via immigration 

film the underground cutting off the film of all the complica- 
tions 

slave, run from lucius slavery 

battle of freedom and liberty 

also the first bull follows luc-up! 


Table 6: Example model outputs for style transfer from 
movie plot summaries. Words replaced by our rhyme 
enhancement step are in bold. 


INPUT (NEWS): temperatures dipped into the mid-30s dur- 


ing 4 days man lay in woods of philadelphia park . mom told 
police son was with her in maryland , but he was found friday 
with blanket , bible . victim being treated for malnutrition , 
dehydration ; mother faces host of charges after extradition . 


SHUFFLE: man i was dipped up in a lay up with some of 
them from an old 

mid-30s days in the park 

mom told me to be in michigan woods 

police blanket friday 

i found my son a bible 

he was a host for the charges 

my mother treated him as an age 

a victim of faces 


SHUFFLE + RE: man i was dipped up in a lay up with some 
of them from an old 

mid-30s days in the home 

mom told me to be in michigan anyway 

police blanket friday 

i found my son a bible 

he was a host for the trial 

my mother treated him as an alien 

a victim of faces 


Table 7: Example model outputs for style transfer from 
news articles. Words replaced by our rhyme enhance- 
ment step are in bold. 


of the time when lyrics are presented side-by-side. 


7 Example Model Outputs 


In Tables 5, 6 and 7, we also display a few manually 
selected example model outputs (additional exam- 
ples are available in the Appendix) produced after 
inputting content words extracted from each of our 
input text styles (existing rap lyrics, movie plot 
summaries and news article summaries). When 
using existing rap lyrics as input, many outputs 
seem coherent and of higher quality in comparison 
to outputs produced using news/movie inputs. For 
news/movie inputs, the models are still capable of 
integrating the input content words into a rhyming 
verse that preserves some of the overall meaning 
of the original text (e.g., ’the film also follows the 
adventures of lucius the slave escaping via the un- 
derground railroad to freedom” — ”slave, run from 
lucius slavery; battle of freedom and liberty”). 

Furthermore, in Table 8 we present examples 
from our side-by-side Turing test, where we asked 
raters to choose which of two lyrics was generated 
(augmented) by RAPFORMER, and which was writ- 
ten by a human. For the selected outputs, two of 
the three raters incorrectly guessed that the lyrics 
generated by RAPFORMER were actually human- 
created. 


8 Conclusion 


We have proposed a novel approach to generation 
of rap verses conditioned on a list of content words. 
We showed that our method is capable of generat- 
ing coherent and technically fluent synthetic verses 
using diverse text types as input, including news ar- 
ticles, movie plot summaries, or original rap verses. 
The fluency of our outputs is further improved 
through a novel rhyme enhancement step. Our 
approach is particularly effective when rephrasing 
the content of existing rap lyrics in novel ways, 
making it a potentially useful tool for creative writ- 
ers wishing to explore alternative expressions of 
their ideas. 

The generality of our approach to conditional 
text generation makes it applicable to generation 
of creative texts in other domains, such as poetry 
or short stories. Future work could explore other 
approaches to extracting content words, including 
combining several stripping approaches, and could 
explore the utility of large-scale pretrained mod- 
els (e.g., (Raffel et al., 2019; Lewis et al., 2020)) 
for this task. Another direction is to extend our 
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Question 45 of 100 

LYRICS (A) 

waka waka: 

they say na blind eye, take it far 

i’ve got it on my own, my own 

oche num, oda du, doka dum so 

if anybody ever try go shoot the almighty 
blazing so amazing 


Which of these lyrics was written by a human? 


LYRICS (B) 

i say na correct eye i take waka this waka 
but after i’ve got you i blind pata pata 
oche du no dum no oda du num doka 
anybody try you i go shoot the murderfker 
ever blazing you amazing 


Correct answer: (B) 


Question 72 of 100 

LYRICS (A) 

vegas on the third floor, like lamar with the cardio 
fascinated by the cars smokin’ dope in the casino 
despise the propaganda rise, higher 

mac-11 camouflage for example, that’s why i never set fires 
i walk with a flame that never match my desires 
take a pic, cause the pain is higher 

i’m rich as a coupe, light it up with kelly 

phone sucker, my friend, it’s a blessing 

benz, plaques, wall, and g6’s 

- ’em all, hustler say the victim 

ciroc and bel air - 

april -’s -, her name so 


Which of these lyrics was written by a human? 


LYRICS (B) 

out in vegas like lamar, third floor tropicana 
fascinated with the cars, smokin’ dope in the phantom 
teflon’s on the rise, 1 despise propaganda 
camouflage mac-11, i should set an example 
never baptized, as i walk through the fires 

the pain and the flame never match my desires 
crucified cause i’m rich, in the coupe, take a pic 
on the phone at the light, kelly rowland’s a friend 
catfish in the benz, manti teo’s a sucker 

plaques on the wall, hustler so i can say ”- ’em” 
bel air for the -, ciroc in the pool 

my - is a -, her name is april’s a fool 


Correct answer: (B) 


Question 74 of 100 

LYRICS (A) 

she cut the call when she was on ma phone 
when you picked up the line 

you got so mad and asked me who’s the girl 
i’m sleeping with behind 

baby, i had no words to say 

so i guess i will try 

not to lie... it’s the time... 


Which of these lyrics was written by a human? 


LYRICS (B) 

i picked up the phone and cut the line and call 
i asked what’s up girl, why you got so long 
i’m sleeping behind you 

baby, i guess i try to say the truth 

but... it’s time to lie... 


Correct answer: (A) 


Table 8: Examples of lyrics generated by RAPFORMER that fooled the majority (at least two out of three) human 
raters in a side-by-side comparison with human created lyrics. Inappropriate words are replaced by a single dash. 


work to end-to-end generation with an integrated 
rhyming loss function, which could potentially be 
tackled using reinforcement learning (Luo et al., 
2019). Moreover, the task of generating coherent 
lyrics from a set of content words could be natu- 
rally modeled as a text-editing task (Dong et al., 
2019; Mallinson et al., 2020; Malmi et al., 2019) 
instead of a sequence-to-sequence task. 
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A Additional Model Outputs 


In Tables 9, 10 and 11 we display a few additional 
manually selected model outputs for each of our 
input domains (rap lyrics, movie summaries and 
news article summaries) and each of our stripping 
approaches (SHUFFLE (RAPFORMER), DROP, and 
SYNONYM). 


B Demo Song 


We generated lyrics for a demo song by using the 
abstract of this paper as the input to RAPFORMER. 
We generated multiple samples, by reshuffling the 
content words of the abstract multiple times. We 
sent all sample lyrics to a rap artist, and asked 
them to record a song using a subset of those lyrics. 
We allowed for re-arranging and deletion, but no 
addition of human-created lyrics. The resulting 
audio file is included in the supplementary material 
?, while the final lyrics of the song are in Table 12. 

We also tested the recently released Jukebox al- 
gorithm (Dhariwal et al., 2020) for end-to-end syn- 
thesis of a rap song conditioned on the abstract 
lyrics. However, our preliminary results were unsat- 
isfactory since it was impossible to tell individual 
words apart from the generated audio. 


°? Also available at https: //bit.ly/3kXGItD. 


INPUT: control the whole domain , and then show no shame 
make rappers go ? and they so lame , playing with no game 
put em on the lil plane til they can ’t claim no fame 

i got , the range , better , stay in the slow lane 

i make the flow change from hurricanes to a slow rain 

your thoughts are so plane , i rearrange your whole frame 
until my whole name grow out your brain like rogaine 


DROP + RE: 

like propane in control, my whole objective show 

no shame, i can make it playin’ the game 

put em in a lil’ game, til they don’t claim fame range 
you better stay in your lane, and make the flow change 
i be on a plane to rearrange your own frame 

your whole name grow in my brain, rogaine letter 


SYNONYM + RE: control the whole domain just to show it no shame 
make these rappers go lame, they playin’ the game 

put em on a lil’ plane til they can’t title the fame 

if you got a range, you better stay in a slow lane 

name run change hurricane slow rain 

thoughts on the ground and rearrange your whole frame 

whole name grow brain like a rogaine cane 


SHUFFLE + RE (RAPFORMER): aim dough like propane came with a whole blow 
shame how you control the whole show 

lame rappers playin’ the game, make your domain go 

can’t claim em til you put a lil fame on a show 

slow down, stay in your lane, got a better range 

make it rain, change your flow, slow rain 

rearrange your whole frame, plane change 

grow like rogaine, put my whole name on your brain 


INPUT: now comes our turn 
we step in, you burn 

you got something to learn 
and bring a breath of fresh air 
this is how we go 

you check this mode 

ride da best you know 

and still want more 


Drop + RE: when it comes to turn to step 
i got something to bring in my breath 

go check my mode 

ride with da so 

you know i still want to be your own 


SYNONYM + RE: when it comes to turn my step to burn 
i got something to learn 

bring your breath to new light 

go check the way i ride 

da best i know 

you still want so 


SHUFFLE + RE (RAPFORMER): when it comes to my turn 
step up and learn something to burn 

i got fresh air on my breath 

go check da best mode, bring it yeah 

you know i still want a ride 


Table 9: Additional model outputs for rap reconstruction. 
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INPUT: hoping to improve his financial lot , petty thief hawk chovinski hires a dancing instructor to teach him how to 
bear himself like a gentleman . his lessons completed , hawk then poses as a european nobleman , intending to trap a 
wealthy wife . yolande cowles sees through hawk ’ s pose but falls in love with him anyway . 


Drop + RE: i improve a grizzly lot of petty thief times 
dancing in the middle of the night 

i am the man who can teach you how to bear it 

like a gentleman with diamonds 

i’m a superheroic, i can be your wife 

yolande cowles tonight 

falls in love anyway 


SYNONYM + RE: hoping that you can improve 
a financial lot of petty use 
mortarboard chovinski engage 
dancing with the snake 

teach her how to settle 

like a gentleman 

lessons are shackled by a bullet 
sit in european 

imagine 

in the trap with a wealthy wife 
yolande hood sees the sky 
when the pose falls in line 
anyway, no, not me 


SHUFFLE + RE (RAPFORMER): you teach me petty dancing like bear thief 
chovinski, intersect, be 
a lot of financial gentleman hoping he can improve somebody 
wife, nobleman, the trap is so polished 
wealthy hawk lessons, european hawk lessons 
yolande cowles anyway, sees him pose when he says 
hawk love! 

INPUT: the film follows the lives of several west point cadet classmates who find themselves on opposite sides of the 
war . the film also follows the adventures of lucius the slave escaping via the underground railroad to freedom with the 
film cutting between the first battle of bull run and the birth of a lucius ° child born in slavery . 


Drop + RE: film of the west point where they can find the opposite sides of ours 
film also and they will be a slave escaping me from the underground, 

and we will not be the same if we are not the maker 

this is a film cutting first bull from birth to child’s slaver. 


SYNONYM + RE: film to succeed our lives in several zones 
our head is the most likely to find our own 

we are not the same as the other side of ever 

film also follows adventure 

the lucius slave, the escaping via underground 

motorical, freedom, film out 

first battle bull, then feed him birth 

golden child, born in order 


SHUFFLE + RE (RAPFORMER): this is the opposite of war follows lives on both sides 
several point film from the west to the light 

find a child born escaping via immigration 

film the underground cutting off the film of all the complications 

slave, run from lucius slavery 

battle of freedom and liberty 

also the first bull follows luc-up! 


Table 10: Additional model outputs for style transfer from movie plot summaries to rap lyrics. 
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INPUT (NEWS): temperatures dipped into the mid-30s during 4 days man lay in woods of philadelphia park . mom told 
police son was with her in maryland , but he was found friday with blanket , bible . victim being treated for malnutrition , 
dehydration ; mother faces host of charges after extradition . 


Drop + RE: i’ve been dipped for days, lay in woods 

in the park with the crook 

my son from pittsburgh found friday blanket, bible victim 
i was born to be a victim of my reality 

with no faces 

host charges, i had to do it everything 


SYNONYM + RE: dipped in mid-a.t. days, man i dwell in ways 
UNK 3: gs, park mom told the police son that he was from illegal 
found him on friday, he had a blanket with a bible 

a victim of how he treated him as a 


SHUFFLE + RE (RAPFORMER): man i was dipped up in a lay up with some of them from an old 
mid-30s days in the home 
mom told me to be in michigan anyway 
police blanket friday 
i found my son a bible 
he was a host for the trial 
my mother treated him as an alien 
a victim of faces 
INPUT (NEWS): jordan spieth won the 2015 masters by four shots on sunday . the 21-year-old american led all week at 
the augusta national golf club . he shot final-round 70 to finish on 18 under par and take the green jacket . 


Drop + RE: to all of my masters, 

four sunday morning, american led alert 
national golf club, final-round time 

take a green jacket 


SYNONYM + RE: jordan, we are not the same, no masters! 
four shots of the sun, the laughter 

we were the most likely american led in a week 

at the first club shot last finish, hey 

get the green cap 


SHUFFLE + RE (RAPFORMER): masters, four shots on sunday 
jordan, led me to the national club, the american way 
golf week, green dine, par 
finish my jacket, take my final-round start 

INPUT (NEWS): the dallas native will play alongside justin rose in the final pairing . has set a scoring record for the first 
54 holes of 16 under par . finished runner-up last year and is now determined to win . is first player since greg norman in 
1996 to have lead after each round . 


Drop + RE: dallas native play 
i was born to be a slave 

but now i’m on my own 

and i’ll be the first so 

justin final scoring holes in par 
last year determined to start 
been a player, since greg 

the only way to tell 


SYNONYM + RE: dallas, c4, i play with the same 

g6, justin rose to the place 

cl, ready to scoring the record 

first holes in the firearm, then i remember 

this is the first year 

determined to win, first player, since marc ellen went here 


SHUFFLE + RE (RAPFORMER): justin rose, native gold 
final par, scoring holes, set it off, play it again, justin rose 
determined to win the first record, last year i was finished 
greg player, he was a player from the beginning 

since first i lead the worldball. 


Table 11: Additional model outputs for style transfer from news articles to rap lyrics. 
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[intro] 

iam the oldest 

the lyrics they just follow orders. 
iam the oldest 

the lyrics they just follow orders. 
good trade-off of your style. 

iam the oldest 

the lyrics they just follow orders. 

i rhyme more rhymes and moreover 
move over I’m recording 


[verse 1] 

another verse written on the news of rap methods, 
given to me in the form of an autoencoder 

to develop the words that i rap, and i will be denoting 
in my text, i am the only content, 

i can be the same as an automatist, 

i train rap lyrics to study different meaning when i approach words as i am, 
I train lyrics that are the most definitive, 

more essential than a scheme of three 

more untouchable than an underflow 

move over. pirana, the founder, moreover. 

my rhyme lyrics are more than the rhyme over 

(when i develop a verse) 


[verse 2] 

when i develop a verse i form a text from an art that is written on the news of an autoencoder rap 
another method given to a train that i have been through and i am not the only thing to do with 
this is my reality 

i will not be content with rap lyrics i approach with the meaning oh 

my words are based on my attack. 

my lyrics are essential as I generate rap. 

my average rhyme scheme is to show you different content 

in other words, i can’t study my own admirations. 

my raps are so amazing 

the rhyme is paraphrasing. 


[bridge] 

my results are very good like I’m a human being 
my rap is in the convoy. 

your lyrics will be so pre-dated. 

(when i develop a verse) 


[outro] 
I’m a human being 
I’m a human being 


Table 12: Lyrics of our demo song, described in Appendix B. 
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